Random Sampling Distributions

Binomial and negative binomial distributions can be valuable for modeling probability distributions that result from random sampling. In the context of scRNA-seq, they can be valuable for determining the number of cells to be sequenced in order to capture a certain number of cells from a rare subtypes. They can also help us predict how many of a rare cell type we anticipate to capture with a given number of cells.

Put simply, if we want to capture r of a certain cell type with confidence, we can use the negative binomial distribution to determine how many cells we need to sequence overall to do so. If we can only sequence k cells, we can use to binomial distribution to determine the range of r cells we anticipate to capture with high confidence.

Sequencing of The Murine Colon

Let’s dive into a real world example. In the mouse colon, tuft cells and enteroendocrine cells are the rarest major cell types within the epithelium, both with an abundance of ~1% (p = 0.01). Assuming that 50 cells are required to detect these cell types bioinformatically, we can model the number of cells we need to sequence in order to capture 50 cells with the negative binomial distribution. In this case, profiling 6,466 cells would allow us to detect at least 50 tuft cells and at least 50 enteroendocrince cells with 95% probability. Keep in mind that this number does not include cells lost to quality control following sequencing.

Let’s try another example.F4/80+Ly6Chigh macrophages in the mouse colon have an abundance of ~2.5% (p = 0.025) within the myeloid cell population. Assume that we would like to capture 250 cells. The negative binomial distribution tells us that profiling 11,049 cells would allow us to detect at least 250 of these macrophages with 95% probability. Again, this number does not account for post-sequencing cell loss due to quality control.

10x Genomics is a popular platform for conducting single-cell sequencing experiments. Their platform supports library preparation of 10,000 cells. Suppose we are using their platform and would like to get a sense of how many F4/80+Ly6Chigh macrophages we anticipate to recover. Using the binomial distribution we can generate a standard curve to model the range of cells we could potentially capture. In this case, we can anticipate profiling 219 to 281 cells with 95% probability. Keep in mind, higher numbers of cells loaded onto droplet based sequencing platfroms is associated with higher cell doublet/multiplet rates.